175904 - Jorge III Altamirano Astorga
Yo mantuve el cluster que tengo en una máquina, por lo que opté esta opción, en vez de Amazon Web Services. Esta opción es mucho más conveniente, dado que quien me patrocina (Peñoles) va a optar por on-prem en vez.
Los nodos mantienen por esta razón los nombres de los integrantes de mi equipo :-)
Los datos de northwind se encuentran en dropbox
Modifica el ejercicio de Rank para que en lugar de obtener el id del producto con mejor rank obtengamos el nombre del producto -requieres un join- con el mejor rank
El ID es el 45, el cual corresponde a Raclette Courdavault.
customers = load '/data/northwind/customers.csv' using PigStorage(',') as (
customerid:chararray, companyname:chararray, contactname:chararray,
contacttitle:chararray, address:chararray, city:chararray,
region:chararray, postalcode:chararray, country:chararray,
phone:chararray, fax:chararray
);
employees = load '/data/northwind/employees.csv' using PigStorage(',') as (
employeeid:int, lastname:chararray, firstname:chararray,
title:chararray, titleofcourtesy:chararray, birthdate:chararray,
hiredate:chararray, address:chararray, city:chararray,
region:chararray, postalcode:chararray, country:chararray,
homephone:chararray, extension:chararray, photo:chararray,
notes:chararray, reportsto:chararray, photopath:chararray
);
orderdetails = load '/data/northwind/order_details.csv' using PigStorage(',') as (
orderid:int, productid:int, unitprice:float, quantity:int,
discount:float
);
--- didn't implement as datetime due to data loss prevention :-)
orders = load '/data/northwind/orders.csv' using PigStorage(',') as (
orderid:int, customerid:chararray, employeeid:int,
orderdate:chararray, requireddate:chararray, shippeddate:chararray,
shipvia:int, freight:float, shipname:chararray,
shipaddress:chararray, shipcity:chararray, shipregion:chararray,
shippostalcode:chararray, shipcountry:chararray
);
products = load '/data/northwind/products.csv' using PigStorage(',') as (
productid:int, productname:chararray, supplierid:int,
categoryid:int, quantityperunit:chararray, unitprice:float,
unitsinstock:int, unitsonorder:int, reorderlevel:int,
discontinued:boolean
);
group_orders = group orderdetails by productid;
count_products = FOREACH group_orders GENERATE group as productid, COUNT($1) as n;
ranked = rank count_products by n DESC;
limited_rank = limit ranked 1;
out = JOIN products by productid, limited_rank by productid;
store out into '/tarea_4/ejercicio_a' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
products::productid,products::productname,products::supplierid,products::categoryid,products::quantityperunit,products::unitprice,products::unitsinstock,products::unitsonorder,products::reorderlevel,products::discontinued,limited_rank::rank_count_products,limited_rank::productid,limited_rank::n
59,Raclette Courdavault,28,4,5 kg pkg.,55.0,79,0,0,,1,59,54
Con los datos de aerolíneas, vuelos y aeropuertos que se encuentran en el dropbox y utilizando Pig contesta las siguietnes preguntas:
Existen 43157 con destino a Honolulu International Airport (HNL).
flights = load '/data/flights/flights.csv' using PigStorage(',') as (
year:int, month:int, day:int, day_of_week:int,
airline:chararray, flight_number:int, tail_number:chararray,
origin_airport:chararray, destination_airport:chararray,
scheduled_departure:int, departure_time:int,
departure_delay:long, taxi_out:int, wheels_off:int,
scheduled_time:int, elapsed_time:int, air_time:int,
distance:int, wheels_on:int, taxi_in:int,
scheduled_arrival:int, arrival_time:int, arrival_delay:int,
diverted:int, cancelled:int, cancellation_reason:chararray,
air_system_delay:chararray, security_delay:chararray,
airline_delay:chararray, late_aircraft_delay:chararray,
weather_delay:chararray);
airports = load '/data/flights/airports.csv' using PigStorage(',') as (
iata_code:chararray, airport:chararray, city:chararray,
state:chararray, country:chararray,
latitude:double, longitude:double);
airlines = load '/data/flights/airlines.csv' using PigStorage(',') as (
iata_code:chararray, airline:chararray);
airlines_airport_arr = JOIN flights by destination_airport, airports by iata_code;
airlines_filter = FILTER airlines_airport_arr BY airport matches 'Honolulu International Airport';
airlines_group = GROUP airlines_filter BY airport;
airlines_count = FOREACH airlines_group GENERATE COUNT($1) AS n_flights, group, flatten(airlines_filter.iata_code) AS iata_code;
airlines_limit = LIMIT airlines_count 1;
store airlines_limit into '/tarea_4/ejercicio_b.1' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
n_flights,group,iata_code
43157,Honolulu International Airport,HNL
La aerolínea con el vuelo con mayor retraso es de American Airlines (AA).
flights = load '/data/flights/flights.csv' using PigStorage(',') as (
year:int, month:int, day:int, day_of_week:int,
airline:chararray, flight_number:int, tail_number:chararray,
origin_airport:chararray, destination_airport:chararray,
scheduled_departure:int, departure_time:int,
departure_delay:long, taxi_out:int, wheels_off:int,
scheduled_time:int, elapsed_time:int, air_time:int,
distance:int, wheels_on:int, taxi_in:int,
scheduled_arrival:int, arrival_time:int, arrival_delay:int,
diverted:int, cancelled:int, cancellation_reason:chararray,
air_system_delay:chararray, security_delay:chararray,
airline_delay:chararray, late_aircraft_delay:chararray,
weather_delay:chararray);
airports = load '/data/flights/airports.csv' using PigStorage(',') as (
iata_code:chararray, airport:chararray, city:chararray,
state:chararray, country:chararray,
latitude:double, longitude:double);
airlines = load '/data/flights/airlines.csv' using PigStorage(',') as (
iata_code:chararray, airline:chararray);
flights_delay = FOREACH flights GENERATE flight_number AS flight_number,
airline AS airline, arrival_delay AS arrival_delay;
flights_order = ORDER flights_delay BY arrival_delay DESC;
flights_limit = LIMIT flights_order 1;
flights_join = JOIN flights_limit by airline, airlines by iata_code;
--- airlines_airport_arr = JOIN flights by destination_airport, airports by iata_code;
--- airlines_filter = FILTER airlines_airport_arr BY airport matches 'Honolulu International Airport';
store flights_join into '/tarea_4/ejercicio_b.2' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
flights_limit::flight_number,flights_limit::airline,flights_limit::arrival_delay,airlines::iata_code,airlines::airline
1322,AA,1971,AA,American Airlines Inc.
El primer día de la semana (Lunes o el Domingo, dependiendo del contexto, pero se puede interpretar que es Lunes, por los estándares internacionales) con 21,073 vuelos cancelados.
flights = load '/data/flights/flights.csv' using PigStorage(',') as (
year:int, month:int, day:int, day_of_week:int,
airline:chararray, flight_number:int, tail_number:chararray,
origin_airport:chararray, destination_airport:chararray,
scheduled_departure:int, departure_time:int,
departure_delay:long, taxi_out:int, wheels_off:int,
scheduled_time:int, elapsed_time:int, air_time:int,
distance:int, wheels_on:int, taxi_in:int,
scheduled_arrival:int, arrival_time:int, arrival_delay:int,
diverted:int, cancelled:int, cancellation_reason:chararray,
air_system_delay:chararray, security_delay:chararray,
airline_delay:chararray, late_aircraft_delay:chararray,
weather_delay:chararray);
airports = load '/data/flights/airports.csv' using PigStorage(',') as (
iata_code:chararray, airport:chararray, city:chararray,
state:chararray, country:chararray,
latitude:double, longitude:double);
airlines = load '/data/flights/airlines.csv' using PigStorage(',') as (
iata_code:chararray, airline:chararray);
flights_cancel = FILTER flights BY cancelled == 1;
flights_group = GROUP flights_cancel BY day_of_week;
flights_delay = FOREACH flights_group GENERATE group AS day_of_week, COUNT($1) AS n_cancelled;
flights_order = ORDER flights_delay BY n_cancelled DESC;
flights_limit = LIMIT flights_order 1;
store flights_limit into '/tarea_4/ejercicio_b.3' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
day_of_week,n_cancelled
1,21073
Son:
flights = load '/data/flights/flights.csv' using PigStorage(',') as (
year:int, month:int, day:int, day_of_week:int,
airline:chararray, flight_number:int, tail_number:chararray,
origin_airport:chararray, destination_airport:chararray,
scheduled_departure:int, departure_time:int,
departure_delay:long, taxi_out:int, wheels_off:int,
scheduled_time:int, elapsed_time:int, air_time:int,
distance:int, wheels_on:int, taxi_in:int,
scheduled_arrival:int, arrival_time:int, arrival_delay:int,
diverted:int, cancelled:int, cancellation_reason:chararray,
air_system_delay:chararray, security_delay:chararray,
airline_delay:chararray, late_aircraft_delay:chararray,
weather_delay:chararray);
airports = load '/data/flights/airports.csv' using PigStorage(',') as (
iata_code:chararray, airport:chararray, city:chararray,
state:chararray, country:chararray,
latitude:double, longitude:double);
airlines = load '/data/flights/airlines.csv' using PigStorage(',') as (
iata_code:chararray, airline:chararray);
flights_cancel = FILTER flights BY cancelled == 1;
flights_group = GROUP flights_cancel BY origin_airport;
flights_4each = FOREACH flights_group GENERATE group AS origin_airport, COUNT($1) AS n_cancelled;
flights_airpr = FILTER flights_4each BY n_cancelled == 17;
flights_join = JOIN flights_airpr BY origin_airport, airports BY iata_code;
store flights_join into '/tarea_4/ejercicio_b.4' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
flights_airpr::origin_airport,flights_airpr::n_cancelled,airports::iata_code,airports::airport,airports::city,airports::state,airports::country,airports::latitude,airports::longitude
DHN,17,DHN,Dothan Regional Airport,Dothan,AL,USA,31.32134,-85.44963
DIK,17,DIK,Dickinson Theodore Roosevelt Regional Airport,Dickinson,ND,USA,46.79739,-102.80195
ESC,17,ESC,Delta County Airport,Escanaba,MI,USA,45.72267,-87.09373
El aeropuerto es Chicago O’Hare International Airport (ORD).
flights = load '/data/flights/flights.csv' using PigStorage(',') as (
year:int, month:int, day:int, day_of_week:int,
airline:chararray, flight_number:int, tail_number:chararray,
origin_airport:chararray, destination_airport:chararray,
scheduled_departure:int, departure_time:int,
departure_delay:long, taxi_out:int, wheels_off:int,
scheduled_time:int, elapsed_time:int, air_time:int,
distance:int, wheels_on:int, taxi_in:int,
scheduled_arrival:int, arrival_time:int, arrival_delay:int,
diverted:int, cancelled:int, cancellation_reason:chararray,
air_system_delay:chararray, security_delay:chararray,
airline_delay:chararray, late_aircraft_delay:chararray,
weather_delay:chararray);
airports = load '/data/flights/airports.csv' using PigStorage(',') as (
iata_code:chararray, airport:chararray, city:chararray,
state:chararray, country:chararray,
latitude:double, longitude:double);
airlines = load '/data/flights/airlines.csv' using PigStorage(',') as (
iata_code:chararray, airline:chararray);
SET DEFAULT_PARALLEL 5;
flights_cancel = FILTER flights BY cancelled == 1;
flights_group = GROUP flights_cancel BY origin_airport;
flights_4each = FOREACH flights_group GENERATE group AS origin_airport, COUNT($1) AS n_cancelled;
flights_order = ORDER flights_4each BY n_cancelled DESC;
flights_limit = LIMIT flights_order 1;
flights_join = JOIN flights_limit BY origin_airport, airports BY iata_code;
store flights_join into '/tarea_4/ejercicio_b.5' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
flights_limit::origin_airport,flights_limit::n_cancelled,airports::iata_code,airports::airport,airports::city,airports::state,airports::country,airports::latitude,airports::longitude
ORD,8548,ORD,Chicago O'Hare International Airport,Chicago,IL,USA,41.9796,-87.90446
Es el vuelo 202 con 46 destinos:
flights = load '/data/flights/flights.csv' using PigStorage(',') as (
year:int, month:int, day:int, day_of_week:int,
airline:chararray, flight_number:int, tail_number:chararray,
origin_airport:chararray, destination_airport:chararray,
scheduled_departure:int, departure_time:int,
departure_delay:long, taxi_out:int, wheels_off:int,
scheduled_time:int, elapsed_time:int, air_time:int,
distance:int, wheels_on:int, taxi_in:int,
scheduled_arrival:int, arrival_time:int, arrival_delay:int,
diverted:int, cancelled:int, cancellation_reason:chararray,
air_system_delay:chararray, security_delay:chararray,
airline_delay:chararray, late_aircraft_delay:chararray,
weather_delay:chararray);
airports = load '/data/flights/airports.csv' using PigStorage(',') as (
iata_code:chararray, airport:chararray, city:chararray,
state:chararray, country:chararray,
latitude:double, longitude:double);
SET DEFAULT_PARALLEL 10;
flights_4each = FOREACH flights GENERATE flight_number, dest;
flights_group = GROUP flights BY flight_number;
flights_disti = FOREACH flights_group {
DA = DISTINCT flights.destination_airport;
GENERATE
group AS flight_number,
DA as destinations,
COUNT(DA) AS n_destinations; };
flights_order = ORDER flights_disti BY n_destinations DESC;
flights_limit = LIMIT flights_order 1;
flights_flatt = FOREACH flights_limit GENERATE
flight_number, n_destinations,
flatten(destinations) AS destination;
flights_join = JOIN flights_flatt BY destination, airports BY iata_code;
flights_out = FOREACH flights_join GENERATE
flight_number, n_destinations, destination, airport, city, state, country;
store flights_out into '/tarea_4/ejercicio_b.6' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
flights_flatt::flight_number,flights_flatt::n_destinations,flights_flatt::destination,airports::airport,airports::city,airports::state,airports::country
202,68,ATL,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA
202,68,AUS,Austin-Bergstrom International Airport,Austin,TX,USA
202,68,BDL,Bradley International Airport,Windsor Locks,CT,USA
202,68,BNA,Nashville International Airport,Nashville,TN,USA
202,68,BOS,Gen. Edward Lawrence Logan International Airport,Boston,MA,USA
202,68,BWI,Baltimore-Washington International Airport,Baltimore,MD,USA
202,68,CAK,Akron-Canton Regional Airport,Akron,OH,USA
202,68,CLE,Cleveland Hopkins International Airport,Cleveland,OH,USA
202,68,DCA,Ronald Reagan Washington National Airport,Arlington,VA,USA
202,68,DEN,Denver International Airport,Denver,CO,USA
202,68,DTW,Detroit Metropolitan Airport,Detroit,MI,USA
202,68,EWR,Newark Liberty International Airport,Newark,NJ,USA
202,68,FLL,Fort Lauderdale-Hollywood International Airport,Ft. Lauderdale,FL,USA
202,68,FNT,Bishop International Airport,Flint,MI,USA
202,68,IAD,Washington Dulles International Airport,Chantilly,VA,USA
202,68,IAH,George Bush Intercontinental Airport,Houston,TX,USA
202,68,IND,Indianapolis International Airport,Indianapolis,IN,USA
202,68,ITO,Hilo International Airport,Hilo,HI,USA
202,68,JFK,John F. Kennedy International Airport (New York International Airport),New York,NY,USA
202,68,LAS,McCarran International Airport,Las Vegas,NV,USA
202,68,LAX,Los Angeles International Airport,Los Angeles,CA,USA
202,68,LGA,LaGuardia Airport (Marine Air Terminal),New York,NY,USA
202,68,MCO,Orlando International Airport,Orlando,FL,USA
202,68,MDW,Chicago Midway International Airport,Chicago,IL,USA
202,68,MKE,General Mitchell International Airport,Milwaukee,WI,USA
202,68,MSN,Dane County Regional Airport,Madison,WI,USA
202,68,MSP,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA
202,68,OAK,Oakland International Airport,Oakland,CA,USA
202,68,OGG,Kahului Airport,Kahului,HI,USA
202,68,OKC,Will Rogers World Airport,Oklahoma City,OK,USA
202,68,ORD,Chicago O'Hare International Airport,Chicago,IL,USA
202,68,PDX,Portland International Airport,Portland,OR,USA
202,68,PHL,Philadelphia International Airport,Philadelphia,PA,USA
202,68,PHX,Phoenix Sky Harbor International Airport,Phoenix,AZ,USA
202,68,PIT,Pittsburgh International Airport,Pittsburgh,PA,USA
202,68,RDU,Raleigh-Durham International Airport,Raleigh,NC,USA
202,68,RNO,Reno/Tahoe International Airport,Reno,NV,USA
202,68,RSW,Southwest Florida International Airport,Ft. Myers,FL,USA
202,68,SAN,San Diego International Airport (Lindbergh Field),San Diego,CA,USA
202,68,SAT,San Antonio International Airport,San Antonio,TX,USA
202,68,SEA,Seattle-Tacoma International Airport,Seattle,WA,USA
202,68,SFO,San Francisco International Airport,San Francisco,CA,USA
202,68,SLC,Salt Lake City International Airport,Salt Lake City,UT,USA
202,68,SMF,Sacramento International Airport,Sacramento,CA,USA
202,68,TPA,Tampa International Airport,Tampa,FL,USA